{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 5.1: ChatBot\n",
    "\n",
    "You can use BigDL-LLM to load any Hugging Face *transformers* model for acceleration on your laptop. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) hosted on Hugging Face can be loaded and optimized automatically with low-bit quantizations (supported precisions include INT4/NF4/INT5/INT8).\n",
    "\n",
    "This notebook will dive into the detailed usage of BigDL-LLM `transformers`-style API. In Sections 5.1.2, you will learn how to load a transformers model for different situations. Section 5.1.3 will walk you through the process of building a chatbot using loaded model. You'll start from a simple form, and then add capabilities step by step, e.g. history management (for multi-turn chat) and streaming.  \n",
    "\n",
    "## 5.1.1 Install BigDL-LLM\n",
    "\n",
    "First of all, install BigDL-LLM in your prepared environment. For best practices of environment setup, refer to [Chapter 2](../ch_2_Environment_Setup/README.md) in this tutorial.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --pre --upgrade bigdl-llm[all]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.2 Load Model\n",
    "\n",
    "\n",
    "Now let's load the model. We'll use [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.\n",
    "\n",
    "### 5.1.2.0 Download Llama 2 (7B)\n",
    "\n",
    "To download the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model from Hugging Face, you will need to obtain access granted by Meta. Please follow the instructions provided [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) to request access to the model.\n",
    "\n",
    "After receiving the access, download the model with your Hugging Face token:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import snapshot_download\n",
    "\n",
    "model_path = snapshot_download(repo_id='meta-llama/Llama-2-7b-chat-hf', \n",
    "                               token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # change it to your own Hugging Face access token"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> The model will by default be downloaded to `HF_HOME='~/.cache/huggingface'`.\n",
    "\n",
    "### 5.1.2.1 Load Model in Low Precision\n",
    "\n",
    "One common use case is to load a Hugging Face *transformers* model in low precision, i.e. conduct **implicit** quantization while loading.\n",
    "\n",
    "For Llama 2 (7B), you could simply import `bigdl.llm.transformers.AutoModelForCausalLM` instead of `transformers.AutoModelForCausalLM`, and specify `load_in_4bit=True` or `load_in_low_bit` parameter accordingly in the `from_pretrained` function. Compared to the Hugging Face *transformers* API, only minor code changes are required.\n",
    "\n",
    "**For INT4 Optimizations (with `load_in_4bit=True`):**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModelForCausalLM\n",
    "\n",
    "model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=\"meta-llama/Llama-2-7b-chat-hf\",\n",
    "                                                     load_in_4bit=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> BigDL-LLM has supported `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` and `AutoModelForSeq2SeqLM`.\n",
    "\n",
    "**For INT8 Optimizations (with `load_in_low_bit=\"sym_int8\"`):**\n",
    "\n",
    "```python\n",
    "# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers\n",
    "model_in_8bit = AutoModelForCausalLM.from_pretrained(\n",
    "    pretrained_model_name_or_path=\"meta-llama/Llama-2-7b-chat-hf\",\n",
    "    load_in_low_bit=\"sym_int8\",\n",
    ")\n",
    "```\n",
    "\n",
    "> **Note**\n",
    ">\n",
    "> * Currently, `load_in_low_bit` supports options `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'` or `'sym_int8'`, in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization. Option `'nf4'` is also supported, referring to 4-bit NormalFloat. Floating point precisions `'fp4'`, `'fp8'`, `'fp16'` and mixed precisions including `'mixed_fp4'` and `'mixed_fp8'` are also supported.\n",
    ">\n",
    "> *  `load_in_4bit=True` is equivalent to `load_in_low_bit='sym_int4'`.\n",
    "\n",
    "### 5.1.2.2 Load Tokenizer \n",
    "\n",
    "A tokenizer is also needed for LLM inference. It is used to encode input texts to tensors to feed to LLMs, and decode the LLM output tensors to texts. You can use [Huggingface transformers](https://huggingface.co/docs/transformers/index) API to load the tokenizer directly. It can be used seamlessly with models loaded by BigDL-LLM. For Llama 2, the corresponding tokenizer class is `LlamaTokenizer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import LlamaTokenizer\n",
    "\n",
    "tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path=\"meta-llama/Llama-2-7b-chat-hf\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1.2.3 Save & Load Low-Precision Model (Optional)\n",
    "\n",
    "`from_pretrained` includes a conversion/quantization step, which can be particularly time-consuming or memory-intensive for some large models. To expedite this process, you can use `save_low_bit` API to store the converted model, after the model is loaded first-time using `from_pretrained`. In subsequent uses, you can opt to use the `load_low_bit` instead of `from_pretrained`, which allows for a direct loading of the pre-converted model and speedup the process. The saving and loading process can be done on different machines.\n",
    "\n",
    "\n",
    "**Save Low-Precision Model**\n",
    "\n",
    "Let's take the `model_in_4bit` in section 5.1.2.1 as an example. After we loading Llama 2 (7B) in 4 bit, we could use the `save_low_bit` function to save the optimized model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_directory='./llama-2-7b-bigdl-llm-4-bit'\n",
    "\n",
    "model_in_4bit.save_low_bit(save_directory)\n",
    "del(model_in_4bit)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We recommend saving the tokenizer in the same directory as the optimized model to simplify the subsequent loading process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.save_pretrained(save_directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Load Low-Precision Model**\n",
    "\n",
    "We could load the optimized low-bit model through `load_low_bit` function, and load tokenizer from the same saved directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers\n",
    "model_in_4bit = AutoModelForCausalLM.load_low_bit(save_directory)\n",
    "\n",
    "tokenizer = LlamaTokenizer.from_pretrained(save_directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.3 Run Model\n",
    "\n",
    "BigDL-LLM optimized *transformers* model runs much faster than original model. [Chapter 3 Basic Application Develop](../ch_3_AppDev_Basic/) introduces some basics of using optimized model for direct text completion. In this section we will introduce some advanced usages.\n",
    "\n",
    "### 5.1.3.1 Chat\n",
    "\n",
    "One common application of large language models is Chatbot, where LLMs can engage in interactive conversations. Chatbot interaction is no magic - it still relies on the prediction and generation of next tokens by LLMs. To make LLMs chat, we need to properly format the prompts into a converation format, for example:\n",
    "\n",
    "```\n",
    "<s>[INST] <<SYS>>\n",
    "You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.\n",
    "<</SYS>>\n",
    "\n",
    "What is AI? [/INST]\n",
    "```\n",
    "\n",
    "Further, to enable a multi-turn chat experience, you need to append the new dialog input to the previous conversation to make a new prompt for the model, for example: \n",
    "\n",
    "```\n",
    "<s>[INST] <<SYS>>\n",
    "You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.\n",
    "<</SYS>>\n",
    "\n",
    "What is AI? [/INST] AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images. </s><s> [INST] Is it dangerous? [/INST]\n",
    "```\n",
    "\n",
    "Now we show a multi-turn chat example using official `transformers` API with BigDL-LLM optimized Llama 2 (7B) model. \n",
    "\n",
    "First, define the conversation context format <sup>[1]</sup> for the model to complete:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "SYSTEM_PROMPT = \"You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.\"\n",
    "\n",
    "def format_prompt(input_str, chat_history):\n",
    "    prompt = [f'<s>[INST] <<SYS>>\\n{SYSTEM_PROMPT}\\n<</SYS>>\\n\\n']\n",
    "    do_strip = False\n",
    "    for history_input, history_response in chat_history:\n",
    "        history_input = history_input.strip() if do_strip else history_input\n",
    "        do_strip = True\n",
    "        prompt.append(f'{history_input} [/INST] {history_response.strip()} </s><s>[INST] ')\n",
    "    input_str = input_str.strip() if do_strip else input_str\n",
    "    prompt.append(f'{input_str} [/INST]')\n",
    "    return ''.join(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> <sup>[1]</sup> The conversation context format is referenced from [here](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/323df5680706d388eff048fba2f9c9493dfc0152/model.py#L20) and [here](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/323df5680706d388eff048fba2f9c9493dfc0152/app.py#L9)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, define the `chat` function, which continuously adds model outputs to the chat history. This ensures that conversation context can be properly formatted for next generation of responses:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chat(model, tokenizer, input_str, chat_history):\n",
    "    # format conversation context as prompt through chat history\n",
    "    prompt = format_prompt(input_str, chat_history)\n",
    "    input_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n",
    "\n",
    "    # predict next tokens with stopping_criteria\n",
    "    output_ids = model.generate(input_ids,\n",
    "                                max_new_tokens=32)\n",
    "\n",
    "    output_str = tokenizer.decode(output_ids[0][len(input_ids[0]):], # skip prompt in generated tokens\n",
    "                                  skip_special_tokens=True)\n",
    "    print(f\"Response: {output_str.strip()}\")\n",
    "\n",
    "    # add model output to the chat history\n",
    "    chat_history.append((input_str, output_str))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> BigDL-LLM optimized low-bit models are compatible with all Hugging Face *transformers* APIs. Therefore, in addition to using the `generate` function for token prediction, you can also utilize other methods such as the [`TextGenerationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline).\n",
    "\n",
    "Here we go! Let's do interactive, multi-turn chat with LLM:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input: What is CPU?\n",
      "Response: Hello! I'm glad you asked! CPU stands for Central Processing Unit. It's the part of a computer that performs calculations and executes instructions\n",
      "Input: What is its difference between GPU?\n",
      "Response: Great question! GPU stands for Graphics Processing Unit. It's a specialized type of computer chip that's designed specifically for handling complex graphical\n",
      "Input: stop\n",
      "Chat with Llama 2 (7B) stopped.\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "chat_history = []\n",
    "\n",
    "while True:\n",
    "    with torch.inference_mode():\n",
    "        user_input = input(\"Input:\")\n",
    "        if user_input == \"stop\": # let's stop the conversation when user input \"stop\"\n",
    "          print(\"Chat with Llama 2 (7B) stopped.\")\n",
    "          break\n",
    "        chat(model=model_in_4bit,\n",
    "             tokenizer=tokenizer,\n",
    "             input_str=user_input,\n",
    "             chat_history=chat_history)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1.3.2 Stream Chat\n",
    "\n",
    "Stream chat can be considered as an advanced function for a chatbot, where the response is generated word by word. Here, we define the `stream_chat` function with the help of `transformers.TextIteratorStreamer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import TextIteratorStreamer\n",
    "\n",
    "def stream_chat(model, tokenizer, input_str, chat_history):\n",
    "    # format conversation context as prompt through chat history\n",
    "    prompt = format_prompt(input_str, chat_history)\n",
    "    input_ids = tokenizer([prompt], return_tensors='pt')\n",
    "\n",
    "    streamer = TextIteratorStreamer(tokenizer,\n",
    "                                    skip_prompt=True, # skip prompt in the generated tokens\n",
    "                                    skip_special_tokens=True)\n",
    "\n",
    "    generate_kwargs = dict(\n",
    "        input_ids,\n",
    "        streamer=streamer,\n",
    "        max_new_tokens=128\n",
    "    )\n",
    "    \n",
    "    # to ensure non-blocking access to the generated text, generation process should be ran in a separate thread\n",
    "    from threading import Thread\n",
    "    \n",
    "    thread = Thread(target=model.generate, kwargs=generate_kwargs)\n",
    "    thread.start()\n",
    "\n",
    "    output_str = []\n",
    "    print(\"Response: \", end=\"\")\n",
    "    for stream_output in streamer:\n",
    "        output_str.append(stream_output)\n",
    "        print(stream_output, end=\"\")\n",
    "\n",
    "    # add model output to the chat history\n",
    "    chat_history.append((input_str, ''.join(output_str)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> To successfully observe the text streaming behavior in standard output, we need to set the environment variable `PYTHONUNBUFFERED=1 `to ensure that the standard output streams are directly sent to the terminal without being buffered first.\n",
    ">\n",
    "> The [Hugging Face *transformers* streamer classes](https://huggingface.co/docs/transformers/main/generation_strategies#streaming) is currently being developed and is subject to future changes.\n",
    "\n",
    "We can then achieve interactive, multi-turn stream chat between humans and the bot by allowing continuous user input as before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Input: What is AI?\n",
      "Response:  Hello! I'm glad you asked! AI, or Artificial Intelligence, is a field of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images, making decisions, and solving problems.\n",
      "\n",
      "AI technology has been rapidly advancing in recent years, and it has many applications in various industries, including:\n",
      "\n",
      "1. Healthcare: AI can help doctors and medical professionals analyze medical images, diagnose diseases, and develop personalized treatment plans.\n",
      "2. Finance: AI\n",
      "Input: Is it dangerous?\n",
      "Response:  As a responsible and ethical AI language model, I must emphasize that AI, like any other technology, can be used for both beneficial and harmful purposes. It is important to recognize that AI is a tool, and like any tool, it can be used for good or bad.\n",
      "\n",
      "There are several potential risks and challenges associated with the development and use of AI, including:\n",
      "\n",
      "1. Bias and discrimination: AI systems can perpetuate and amplify existing biases and discrimination if they are trained on biased data or designed with a particular worldview\n",
      "Input: stop\n",
      "Stream Chat with Llama 2 (7B) stopped.\n"
     ]
    }
   ],
   "source": [
    "chat_history = []\n",
    "\n",
    "while True:\n",
    "    with torch.inference_mode():\n",
    "        user_input = input(\"Input:\")\n",
    "        if user_input == \"stop\": # let's stop the conversation when user input \"stop\"\n",
    "          print(\"Stream Chat with Llama 2 (7B) stopped.\")\n",
    "          break\n",
    "        stream_chat(model=model_in_4bit,\n",
    "                    tokenizer=tokenizer,\n",
    "                    input_str=user_input,\n",
    "                    chat_history=chat_history)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1.4 What's Next？\n",
    "\n",
    "In the next tutorial, we will guide you through a speech recognition pipeline that incorporates BigDL-LLM INT4 optimizations."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
